Objective

The analyses performed in this example are intended to illustrate the process of estimating survival times in hybrid decision networks using the HydeNet package. It also gives a brief explanation of how the predicted survival times are calculated. The analyses are not particularly practical and should not be taken seriously.

Introduction

Beginning in HydeNet 0.10.4, you may include survival analysis nodes within a network. These nodes have the same appearance and behavior as other nodes, and only require that you use a parametric survival model with the function survival::survreg. The only significant difference in behavior is that you must explicitly define the nodeFormula argument in setNode.

To explore the development of survival models in HydeNet, we may use the veteran dataset from the survival package. The veteran dataset comes from a randomized trial of lung patients for two types of treatment. A sample of the data, after some formatting, are presented in Table 1.

library(survival)
library(dplyr)
data(veteran)

veteran %<>%
  mutate(trt = factor(trt, 
                      levels = 1:2,
                      labels = c("standard", "test")),
         prior = factor(prior,
                        c(0, 10),
                        c("no", "yes")))
head(veteran) %>%
  dust(caption = "Sample of Veterans Data") %>%
  hyde_medley() 

Table 1: Sample of Veterans Data
trt celltype time status karno diagtime age prior
standard squamous 72 1 60 7 69 no
standard squamous 411 1 70 5 64 yes
standard squamous 228 1 60 3 38 no
standard squamous 126 1 60 9 63 yes
standard squamous 118 1 70 11 65 yes
standard squamous 10 1 20 5 49 no


A quick univariable analysis, presented in Table 2, illustrates that is apparent balance between the standard and test groups. The balance by celltype is a little weak, but probably wasn’t considered in the randomization anyway.

Table 2: Evaluation of the Balance of Randomization

    standard test    
Factor Total N Statistics N Statistics Test Statistic p-value
time   a 137 69 115.14 (90.15, 140.53) 68 128.21 (87.78, 177.93) 2564 0.35W
celltype   b 137         7.04 0.071C
  squamous 35 15 42.86 20 57.14      
  smallcell 48 30 62.5 18 37.5      
  adeno 27 9 33.33 18 66.67      
  large 27 15 55.56 12 44.44      
karno   a 137 69 59.2 (54.64, 63.41) 68 57.93 (53, 62.6) 2419.5 0.75W
diagtime   a 137 69 8.65 (6.84, 10.93) 68 8.9 (6.4, 12.21) 2564.5 0.35W
age   a 137 69 57.51 (54.93, 60.1) 68 59.12 (56.51, 61.59) 2175 0.46W
prior   b 137         0.02 0.89C
  no 97 48 49.48 49 50.52      
  yes 40 21 52.5 19 47.5      


a Mean (95% Bootstrap CI); b Percentage
C: Pearson’s Chi-squared test
W: Wilcoxon rank sum test with continuity correction

Representation of the simple model

The simplest representation of these data in HydeNet is developed as a single model with node time where all of the independent variables are parents.

library(HydeNet)
library(magrittr)
library(ggplot2)
Lung_simple <- 
  HydeNetwork( ~ time | trt * celltype * karno * diagtime * age * prior,
               data = veteran)
plot(Lung_simple)

Looking at the parameter specifications for the time node shows us that HydeNet assumes time is to be calculated from a linear regression. We need to explicitly tell HydeNet to use a parametric survival regression.

print(Lung_simple, time)
## A Probabilistic Graphical Network
## Has data attached: Yes
## 
## time | trt * celltype * karno * diagtime * age * prior
## dnorm(mu = fromData, tau = fromData)
## lm: time ~ trt + celltype + karno + diagtime + age + prior
Lung_simple %<>%
  setNode(time, 
          nodeType = "dnorm",
          nodeFitter = "survreg",
          nodeFormula = Surv(time, status) ~ trt + celltype + karno + 
                          diagtime + age + prior,
          fitterArgs = list(dist = "weibull", scale = 1),
          mu = fromData(),
          tau = fromData())

Notice that we declared the nodeType for the time node to be "dnorm", just as we would have for linear regression. We are able to do this because the outcome we are predicting is time, and the prediction has an approximately normal distribution. The standard error of the predictions is calculated, in matrix notation, as

\[ X \cdot VV \cdot X' \]

Where \(X\) is the matrix of independent variables, \(X'\) is it’s transpose, and \(VV\) is the variance covariance matrix. These matrices are determined from the model built by survreg and written into the JAGS model. As we’re about to see, this adds quite a bit of additional text to our JAGS code. But for the most part, this additional code is transparent to our analysis. The additions appear in the code as vvmat.time, xmat.time, and xmatprime.time.

writeNetworkModel(Lung_simple, pretty = TRUE)
## model{
##    vv.time[1,1] <- 0.49566; vv.time[1,2] <- -0.00209; vv.time[1,3] <- -0.04103; vv.time[1,4] <- -0.0343; vv.time[1,5] <- -0.0311; vv.time[1,6] <- -0.00232; vv.time[1,7] <- -0.00113; vv.time[1,8] <- -0.00536; vv.time[1,9] <- -0.00473; vv.time[2,1] <- -0.00209; vv.time[2,2] <- 0.03946; vv.time[2,3] <- 0.01686; vv.time[2,4] <- -0.0022; vv.time[2,5] <- 0.01126; vv.time[2,6] <- -3e-05; vv.time[2,7] <- -0.00015; vv.time[2,8] <- -0.00036; vv.time[2,9] <- -0.0036; vv.time[3,1] <- -0.04103; vv.time[3,2] <- 0.01686; vv.time[3,3] <- 0.0687; vv.time[3,4] <- 0.03624; vv.time[3,5] <- 0.0374; vv.time[3,6] <- 0.00022; vv.time[3,7] <- -0.00043; vv.time[3,8] <- -0.00031; vv.time[3,9] <- 0.00985; vv.time[4,1] <- -0.0343; vv.time[4,2] <- -0.0022; vv.time[4,3] <- 0.03624; vv.time[4,4] <- 0.07608; vv.time[4,5] <- 0.03194; vv.time[4,6] <- 0.00015; vv.time[4,7] <- -1e-05; vv.time[4,8] <- -0.00019; vv.time[4,9] <- 0.01113; vv.time[5,1] <- -0.0311; vv.time[5,2] <- 0.01126; vv.time[5,3] <- 0.0374; vv.time[5,4] <- 0.03194; vv.time[5,5] <- 0.07432; vv.time[5,6] <- -7e-05; vv.time[5,7] <- -0.00014; vv.time[5,8] <- -9e-05; vv.time[5,9] <- 0.00282; vv.time[6,1] <- -0.00232; vv.time[6,2] <- -3e-05; vv.time[6,3] <- 0.00022; vv.time[6,4] <- 0.00015; vv.time[6,5] <- -7e-05; vv.time[6,6] <- 3e-05; vv.time[6,7] <- 1e-05; vv.time[6,8] <- 1e-05; vv.time[6,9] <- -0.00016; vv.time[7,1] <- -0.00113; vv.time[7,2] <- -0.00015; vv.time[7,3] <- -0.00043; vv.time[7,4] <- -1e-05; vv.time[7,5] <- -0.00014; vv.time[7,6] <- 1e-05; vv.time[7,7] <- 8e-05; vv.time[7,8] <- 1e-05; vv.time[7,9] <- -0.00089; vv.time[8,1] <- -0.00536; vv.time[8,2] <- -0.00036; vv.time[8,3] <- -0.00031; vv.time[8,4] <- -0.00019; vv.time[8,5] <- -9e-05; vv.time[8,6] <- 1e-05; vv.time[8,7] <- 1e-05; vv.time[8,8] <- 8e-05; vv.time[8,9] <- 4e-05; vv.time[9,1] <- -0.00473; vv.time[9,2] <- -0.0036; vv.time[9,3] <- 0.00985; vv.time[9,4] <- 0.01113; vv.time[9,5] <- 0.00282; vv.time[9,6] <- -0.00016; vv.time[9,7] <- -0.00089; vv.time[9,8] <- 4e-05; vv.time[9,9] <- 0.05147
##    xmat.time[1,1] <- 1; xmat.time[1,2] <- trt; xmat.time[1,3] <- (celltype == 2); xmat.time[1,4] <- (celltype == 3); xmat.time[1,5] <- (celltype == 4); xmat.time[1,6] <- karno; xmat.time[1,7] <- diagtime; xmat.time[1,8] <- age; xmat.time[1,9] <- prior
##    xmatprime.time[1,1] <- 1; xmatprime.time[2,1] <- trt; xmatprime.time[3,1] <- (celltype == 2); xmatprime.time[4,1] <- (celltype == 3); xmatprime.time[5,1] <- (celltype == 4); xmatprime.time[6,1] <- karno; xmatprime.time[7,1] <- diagtime; xmatprime.time[8,1] <- age; xmatprime.time[9,1] <- prior   
## 
##    time ~ dnorm(exp(0.00611 * age + -1.11312 * (celltype == 3) + -0.37722 * (celltype == 4) + -0.82024 * (celltype == 2) + -3e-04 * diagtime + 0.03062 * karno + -0.04948 * (prior == 2) + -0.21957 * (trt == 2) + 3.18861),
##                 1 / (xmat.time[,] %*% vv.time[,] %*% xmatprime.time[,]))
##    pi.trt[1] <- 0.50365; pi.trt[2] <- 0.49635 
## trt ~ dcat(pi.trt)
##    pi.celltype[1] <- 0.25547; pi.celltype[2] <- 0.35036; pi.celltype[3] <- 0.19708; pi.celltype[4] <- 0.19708 
## celltype ~ dcat(pi.celltype)
##    karno ~ dnorm(58.56934, 0.0499)
##    diagtime ~ dnorm(8.77372, 0.09423)
##    age ~ dnorm(58.30657, 0.09486)
##    pi.prior[1] <- 0.70803; pi.prior[2] <- 0.29197 
## prior ~ dcat(pi.prior)
## }

From here, we perform our analysis as usual. We compile the network model, generate a posterior distribution, and evaluate the results. As seen below, there doesn’t appear to be much of a difference in survival time between the treatments.

Post_simple <- 
  compileJagsModel(Lung_simple) %>%
  HydePosterior(variable.names = c("time", "trt", "celltype", "karno", "diagtime", 
                                   "age", "prior"),
                n.iter = 1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 0
##    Unobserved stochastic nodes: 7
##    Total graph size: 213
## 
## Initializing model
options(pixie_count = 2)
Post_simple %>%
  group_by(trt) %>%
  do(broom::tidy(as.data.frame(.))) %>%
  ungroup() %>%
  filter(column == "time") %>%
  dust(caption = "Comparison of Survival Times Using Basic Survival Model") %>%
  sprinkle(round = 3) %>%
  hyde_medley()
## Warning: attributes are not identical across measure variables; they will
## be dropped

Table 3: Comparison of Survival Times Using Basic Survival Model
trt column n mean sd median trimmed mad min max range skew kurtosis se
standard time 463 127.432 56.851 106.899 122.425 55.132 46.955 326.07 279.115 0.689 -0.559 2.642
test time 537 99.993 45.106 82.668 95.705 40.639 36.318 222.853 186.534 0.718 -0.631 1.946


ggplot(Post_simple,
       mapping = aes(x = trt,
                     y = time,
                     colour = trt)) + 
  geom_boxplot()

For the purposes of comparison, let’s run the model again, but let’s observer the values for the first patient in the veteran dataset to see how HydeNet compares.

fit <- survreg(Surv(time, status) ~ trt + celltype + karno + 
                  diagtime + age + prior,
               data = veteran,
               dist = "weibull",
               scale = 1)
predict(fit)[1]
## [1] 231.6903
veteran[1, ]
##        trt celltype time status karno diagtime age prior
## 1 standard squamous   72      1    60        7  69    no
compileJagsModel(Lung_simple, 
                 data = list(trt = "standard",
                             celltype = "squamous",
                             karno = 60,
                             diagtime = 7,
                             age = 69,
                             prior = "yes")) %>%
  HydePosterior(variable.names = "time",
                n.iter = 1000) %>%
  summarise(median = median(time))
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 6
##    Unobserved stochastic nodes: 1
##    Total graph size: 212
## 
## Initializing model
##     median
## 1 220.4604

A More Complex Model

Now, for the sake of illustration, let’s abandon the fact that this lung study was randomized. Instead, we will look at the data as if it were an observational study. Furthermore, let us assume we believe the following:

With these beliefs, we construct our network as follows:

Lung <- HydeNetwork(~ diagtime | prior * karno + 
                      age + 
                      celltype + 
                      karno | celltype + 
                      trt | celltype * karno * diagtime * age * prior + 
                      time | trt * celltype * karno * diagtime * age * prior,
                    data = veteran)

plot(Lung)

Before proceeding, it would be prudent to determine if HydeNet’s default models will be suitable to the data. The results are not displayed, but the code is provided.

#* Plot settings
par(mfrow=c(2, 2))

#* Default karno model is acceptable.
fit.karno <- lm(karno ~ celltype, data = veteran)
plot(fit.karno)

#* Default diagtime model is not ideal
fit.diagtime <- lm(diagtime ~ karno + prior, data = veteran)
plot(fit.diagtime)

#* Log diagtime model is better
fit.logdiag <- lm(log(diagtime) ~ karno + prior, data = veteran)
plot(fit.logdiag)

#* Restore plot settings
par(mfrow = c(1, 1))

For convenience, we will create a variable for the log-diagnostic time in our data set and reconstruct the network appropriately.

veteran %<>%
  mutate(log_diagtime = log(diagtime))

Lung <- HydeNetwork(~ log_diagtime | prior * karno + 
                      age + 
                      celltype + 
                      karno | celltype + 
                      trt | celltype * karno * log_diagtime * age * prior + 
                      time | trt * celltype * karno * log_diagtime * age * prior,
                    data = veteran)

plot(Lung)

First, we define the model for time, as was done previously.

Lung %<>%
  setNode(time, 
          nodeType = "dnorm",
          nodeFitter = "survreg",
          nodeFormula = Surv(time, status) ~ trt + celltype + karno + 
                          log_diagtime + age + prior ,
          fitterArgs = list(dist = "weibull", scale = 1),
          mu = fromData(),
          tau = fromData())

Lastly, since we have assumed the decision to treat is based on physician judgment, we will declare trt a decision node so that we can evaluate if patients would do better on one treatment or another.

Lung %<>%
  setDecisionNodes(trt)

plot(Lung)

Our next step is to estimate the survival time for the first patient in the veteran data set.

Post <- 
  Lung %>%
  compileDecisionModel(data = list(celltype = "squamous", 
                                   karno = 60, 
                                   log_diagtime = 1.94591,
                                   age = 69,
                                   prior = "no")) %>%
  HydePosterior(variable.names=c("trt", "time"),
                n.iter = 1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 0
##    Unobserved stochastic nodes: 7
##    Total graph size: 305
## 
## Initializing model
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 6
##    Unobserved stochastic nodes: 1
##    Total graph size: 304
## 
## Initializing model
## 
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 6
##    Unobserved stochastic nodes: 1
##    Total graph size: 303
## 
## Initializing model
ggplot(Post,
       mapping = aes(x = trt,
                     y = time, 
                     colour = trt)) + 
  geom_boxplot()

It appears that survival time isn’t much different by treatment. This isn’t a terribly surprising result, considering the results of the survival model itself. The hazard ratio for treatment indicates only a 19.7% reduction in hazard for patients on the test treatment. The celltype appears to have a greater impact on survival than the treatment.

dust(fit,
     descriptors = c("term_plain", "level"),
     caption = "Summary of Survival Model") %>%
  sprinkle(cols = c(3, 7, 8), 
           fn = quote(exp(value))) %>%
  sprinkle(cols = 3:8,
           halign = "right") %>%
  sprinkle(cols = 1:2,
           rows = 1,
           replace = c("(Intercept)", "")) %>%
  sprinkle_colnames("Term", "Level", "HR", "Std. Error", 
                    "Z", "P-value", "LCL", "UCL") %>%
  medley_model() %>%
  hyde_medley()

Table 4: Summary of Survival Model
Term Level HR Std. Error Z P-value LCL UCL
(Intercept) 24.25 0.7 4.53 < 0.001 6.1 96.4
trt test 0.8 0.2 -1.11 0.27 0.54 1.19
celltype smallcell 0.44 0.26 -3.13 0.002 0.26 0.74
celltype adeno 0.33 0.28 -4.04 < 0.001 0.19 0.56
celltype large 0.69 0.27 -1.38 0.17 0.4 1.17
karno 1.03 0.01 6 < 0.001 1.02 1.04
diagtime 1 0.01 -0.03 0.97 0.98 1.02
age 1.01 0.01 0.67 0.5 0.99 1.02
prior yes 0.95 0.23 -0.22 0.83 0.61 1.48


Out of curiosity, let’s modify the network again to explore the hypothesis that the new treatment may benefit some cell types better than others. To do this, we’ll declare celltype a decision node.

Lung %<>%
  setDecisionNodes(celltype)

plot(Lung)

Post2 <- 
  Lung %>%
  compileDecisionModel() %>%
  HydePosterior(variable.names = "time",
                n.iter = 1000)
ggplot(Post2,
       mapping = aes(x = trt,
                     y = time,
                     colour = trt)) + 
  geom_boxplot() + 
  facet_grid(~ celltype)

Defining the Network with Model Objects

As with other HydeNet network objects, it is permissible to define network nodes using R model objects. Suppose we constructed our models for the network before building the network.

The representation below renders differently than the earlier version of the network. It is, in fact, the same network, but the nodes are placed differently due to the order in which nodes are identified and renderd by GraphViz.

fit <- survreg(Surv(time, status) ~ trt + celltype + karno + 
                  log_diagtime + age + prior,
               data = veteran,
               dist = "weibull",
               scale = 1)
fit.logdiag <- lm(log_diagtime ~ karno + prior, data = veteran)

LungMod <- 
  HydeNetwork(list(fit, fit.logdiag)) %>%
  update( ~ . + 
            karno | celltype + 
            trt | age * prior * log_diagtime * celltype * karno) %>%
  setDecisionNodes(trt, celltype)

plot(LungMod)

The structure of the network is the same and any subsequent analyses (obtaining the posterior distribution and the summaries that follow) are done using the same methods described above.

print(LungMod)
## A Probabilistic Graphical Network
## Has data attached: No
## 
## karno | celltype
## dnorm(mu = Unspecified, tau = Unspecified)
## : karno ~ 1
## 
## celltype
## dnorm(mu = Unspecified, tau = Unspecified)
## : celltype ~ 1
## 
## trt | karno * celltype * age * prior * log_diagtime
## dnorm(mu = Unspecified, tau = Unspecified)
## : trt ~ 1
## 
## age
## dnorm(mu = Unspecified, tau = Unspecified)
## : age ~ 1
## 
## prior
## dnorm(mu = Unspecified, tau = Unspecified)
## : prior ~ 1
## 
## log_diagtime | karno * prior
## dnorm(mu =  -0.00596*karno + 0.96735*prior + 1.81978, tau = 1.3261374752273)
## lm: log_diagtime ~ karno + prior
## 
## time | karno * celltype * trt * age * prior * log_diagtime
## dnorm(mu =  exp(0.0063*age + -1.11364*(celltype==3) + -0.38005*(celltype==4) + -0.82503*(celltype==2) + 0.03076*karno + 0.0178*log_diagtime + -0.07181*(prior==2) + -0.21729*(trt==2) + 3.1433), tau = 1 / (xmat.time[,] %*% vv.time[,] %*% xmatprime.time[,]))
## survreg: Surv(time, status) ~ trt + celltype + karno + log_diagtime + age + prior